University of London

BSc Computer Science (Artificial Intelligence & Machine Learning)

Student ID: 200628880

Abstract

The hackers or the attacker's purpose is to obtain the victim's personal information, user credentials, or to install malware on our devices etc. Researchers propose a number of strategies or techniques to overcome this problem, but machine learning-based detection outshine all of them. This report proposes an idea by examine the URL's lexical features etc. Train and test the model, evaluate it with 7 different machine learning algorithms or techniques. As a result, in terms of accuracy, the Random Forest classifier outperforms the other classifiers.

Malicious Uniform Resource Locator (URLs) Analysis & Detection using Machine Learning Techniques

Data Retrieval

This data was retrieved from Kaggle.com. Which it contains a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs.

Data Exploration

The dataframe has 651191 rows and 2 columns

There are no null value in any of the columns. All the columns are of type object.

This dataset have about 65.74% out of 100% URLs that is safe and 34.25% out of 100% that is malicious URLs. Therefore, this shows that there are a lot more benign(safe) URL's than the rest of malicious URL's. Which consider not too bad.

Feature Engineering

Theoretically speaking "https" is more secure than "http", because "https" use data encryption and it scrambles the data before transmission. But base on the current data, there are some of the phishing website or malware URLs is also using "https".

Therefore, "https" site still can be hacked and does not confirm that the site is legitimate.

By parsing the URLs to get more insights of the protocol of the URLs to see if it exists.

A URL shortening service is to condenses web addresses. There are App out there also known as a link shortener, redirects the shorter URL to the original.

But in this case, URL Shortening services are a great source for spammers and hackers to get a hold of victim computer. By sharing a link and fool user into clicking the link.

Thus, either setting up the virus in the victim computer or getting important user credentials through fake URL. This definitely breaches privacy and security of the user.

For shortening URL "1" refer to the URL does not contain any of the shortening service whereas for "-1" means that there are URL which contain the shortening part.

Heat map

A correlation matrix denotes the correlation coefficients between variables at the same time. A heat map grid can represent these coefficients to build a visual representation of the variables’ dependence. This visualization makes it easy to spot the strong dependencies.

A positive correlation indicates a strong dependency.

A negative correlation indicates a strong inverse dependency; a correlation coefficient closer to zero indicates weak dependence.

At this point we can tell that the longest URL length is a benign URL whereas the shortest is a phishing URL.

Plots and graphs are displayed to find out how the data is distributed and how features are related to each other.

We need to shuffle the data to balance out the distribution while splitting it into training and testing sets. This also eliminates the possibility of overfitting during model training.

Splitting the Data

Now that the data wrangling is complete, the data needs to be splitted.

There are several methods for dividing the dataset into test and training sets. The method adopted in this project is to use the identification of each instance to determine whether or not it should be included in the test set. We only need to ensure that new data is added at the end of the dataset and that no row is destroyed with this solution. This has the advantage of not generating a separate test set if the program has to run again.

Now that we have the test and training sets, we need separate them into data and label. Our "y" value is the label, which will just contain the type because that is what we want to predict. The data consists of our "x" values, which represent all of the other properties or attributes.

Data Scaling This is important because if the data is not scaled, the model might result in erroneous prediction.

"Training Set" feature scaling

As with all transformations, we need to fit the scalers to only the training data and not the entire dataset (including the testing set). The training data is partitioned into numerical and categorical values before being integrated in a comprehensive pipeline.

*StandardScaler*

The standard score of a sample x is calculated as: z = (x - u) / s

where "u" is the mean of the training samples or zero if with_mean=False

"s" is the standard deviation of the training samples or one if with_std=False

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set

Mean and standard deviation are then stored to be used on later data using transform.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data

"Testing Set" feature scaling

We must execute the entire pipeline on the test set as well, but only transform() it depending on the fit performed on the train set. The test set should not be fit transformed().

Investigate and train various machine learning models

This is a classification problems, the classification Machine learning Models listed below will be used and evaluate to see which outperform than the rest of the models.

Logistic Regression Logistic Regression is using one-vs-rest type of classification. It was used as it involves splitting the multi-class dataset into multiple binary classification problems and it's easy to implement and interpret. It is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The intention behind using logistic regression is to find the best fitting model to describe the relationship between the dependent and the independent variable.

A confusion matrix is generated to better visualize how basic logisticReg model performed on our test set

To better understand the labels, another confusion matrix was plotted to make it clear.

By following the Logistic Regression model, we can see that the accuracy after cross validation on the training set is 86.710%, and the model took about 33 seconds to run. After predicting on the testing set, the accuracy of accurately classified classes is 86.630%.

Decision Tree classifier Classifier based on Decision Tree Decision trees are popular models for classification and regression problems. They basically learn a hierarchy of if/else questions that leads to a choice. Learning a decision tree entails learning the series of if/else questions that leads us to the correct answer as rapidly as possible. These "questions" are referred to as tests in the context of machine learning (not to be confused with the test set, which is the data we use to test to see how generalizable our model is). To build a tree, To build a tree, the algorithm runs through all possible tests and selects the one that provides the most information about the target variable. Decision Tree Classifier is built on decision trees, with the branches representing observations and the leaves representing conclusions or class labels. The decision tree classifier performs well in multi-class classification problems. The only disadvantage is that a decision tree classifier is prone to overfitting because it generates nodes to fit the data until the end and does not generalize. That is why hyperparameters can aid in determining the best max depth.

On the test set, the basic trained Decision Tree is used to predict.

By following the Decision Tree Classifier, we can see that the accuracy after cross validation on the training set is 91.816%, and the model took about 18 seconds to run. After predicting on the testing set, the accuracy of accurately classified classes is 91.910%.

Random Forests is an ensamble learning technique that generates several decision trees and performs classification on each. The final forecast is made using the most outputted class from each classification tree. Random forests are currently among the most popular machine learning approaches for regression and classification. A random forest is essentially a collection of decision trees, each of which differs slightly from the others. The idea behind random forests is that each tree may be relatively excellent at predicting. Random forests, also known as neural nets, provide estimates for variable relevance. They also provide an improved way for dealing with missing data. Missing values are replaced by the variable that appears the most in a given node. Random forests have the highest accuracy of any classification approach available. The random forest technique can also handle large amounts of data with thousands of variables. When a class is more infrequent than other classes in the data, it can automatically balance data sets. The approach also handles variables quickly, making it appropriate for complex tasks.

By following the Random Forest, we can see that the accuracy after cross validation on the training set is 92.240%, and the model took about 138 seconds to run. After predicting on the testing set, the accuracy of accurately classified classes is 92.299%.

Multilayer Perceptrons (MLPs): Deep Learning Multilayer perceptrons (MLPs) are also known as (vanilla) feed-forward neural networks, or sometimes just neural networks. Multilayer perceptrons can be applied for both classification and regression problems. It is a neural network connecting multiple layers in a directed graph, which means that the signal path through the nodes only goes one way. Each node, apart from the input nodes, has a nonlinear activation function. An MLP uses backpropagation as a supervised learning technique. Since there are multiple layers of neurons, MLP is a deep learning technique. MLPs can be viewed as generalizations of linear models that perform multiple stages of processing to come to a decision.

By following the Multilayer Perceptrons, we can see that the accuracy after cross validation on the training set is 91.215%, and the model took about 556 seconds to run. After predicting on the testing set, the accuracy of accurately classified classes is 91.406%.

Gradient Boosting Classifier is a machine learning technique for regression and classification problems that produce a prediction model in the form of an ensemble of weak prediction models. This technique builds a model in a stage-wise fashion and generalizes the model by allowing optimization of an arbitrary differentiable loss function. Gradient boosting basically combines weak learners into a single strong learner in an iterative fashion. As each weak learner is added, a new model is fitted to provide a more accurate estimate of the response variable. The new weak learners are maximally correlated with the negative gradient of the loss function, associated with the whole ensemble. The idea of gradient boosting is that you can combine a group of relatively weak prediction models to build a stronger prediction model. Unlike other ensemble techniques, the idea in gradient boosting is that they build a series of trees where every other tree tries to correct the mistakes of its predecessor tree. It is a very powerful technique for building predictive models. Gradient boosting is applicable to many different risk functions and optimizes prediction accuracy of those functions, which is an advantage to conventional fitting methods.

By following the Gradient Boosting Classifier, we can see that the accuracy after cross validation on the training set is 89.883%, and the model took about 946 seconds to run. After predicting on the testing set, the accuracy of accurately classified classes is 89.817%.

Naive Bayes

By following the Naive Bayes, we can see that the accuracy after cross validation on the training set is 69.875%, and the model took about 6 seconds to run. After predicting on the testing set, the accuracy of accurately classified classes is 78.900%.

Stochastic Gradient Descent Classifiers

By following the Stochastic Gradient Descent Classifiers, we can see that the accuracy after cross validation on the training set is 85.894%, and the model took about 556 seconds to run. After predicting on the testing set, the accuracy of accurately classified classes is 85.763%.

Compare the Training Scores

After we have all the training data, it's time to put it all together and visualize it and make comparison

Based on the table shown above, the Random Forest model which is 92.23% which is the highest among all the models

From the above table, Random Forest has the highest accuracy, precision and recall scores among all the other model as well

The above barchart is in descending order based on the accuracy.

This is a different way to compare the models by checking their training time. From the table it shows that Naive Bayes took the shortest time to train the model whereas Gradient Boosting Classifier took the longest time to train.

Increase the accuracy by fine tuning the models

Randomized Grid SearchCV is more efficient for hyper-parameter optimization than grid trials. Firstly, establish a parameter grid from which to sample throughout the random search training. Among the most critical parameters, just a few random forest parameters were picked. Because the goal is to improve the model's accuracy, increasing the number of trees is the best solution. However, it might slow down our model.
Voting Classifier - Ensemble Apart form the Randomized Grid Search method, another way to fine-tune the models is to combine the ones which performed better. Therefore, this ensemble tends to perform better. The models chosen are the top 3 highest accuracy level among the others - RandomForest, DecisionTree and Multi-layer perceptrons.

6.4 Comparison of the fine-tuned models

When we compare the basic models to the randomizedgridsearch, we can see that the randomizedgridsearch outperformed or was comparable to the basic model.

When all three models are combined, the ensemble performs admirably.

Conclusion

This is consider successfully which it has achieved an overall accuracy score of 93% by using Random Forest Classifier.